最近,培训预培训方法在以任务为导向的对话框(TOD)系统中表现出了很大的成功。但是,大多数现有的预培训模型用于TOD专注于对话的理解或对话生成,但并非两者兼而有之。在本文中,我们提出了Space-3,这是一种新型的统一的半监督预培训的预训练的对话模型,从大规模对话CORPORA中学习有限的注释,可以有效地对广泛的下游对话任务进行微调。具体而言,Space-3由单个变压器中的四个连续组件组成,以维护TOD系统中的任务流:(i)对话框编码模块编码对话框历史记录,(ii)对话框理解模块以从任一用户中提取语义向量查询或系统响应,(iii)一个对话框策略模块,以生成包含响应高级语义的策略向量,以及(iv)对话框生成模块以产生适当的响应。我们为每个组件设计一个专门的预训练目标。具体而言,我们预先培训对话框编码模块,使用跨度掩码语言建模,以学习上下文化对话框信息。为了捕获“结构化对话框”语义,我们通过额外的对话注释通过新颖的树诱导的半监视对比度学习目标来预先培训对话框理解模块。此外,我们通过将其输出策略向量与响应响应的语义向量之间的L2距离最小化以进行策略优化,从而预先培训对话策略模块。最后,对话框生成模型由语言建模预先训练。结果表明,Space-3在八个下游对话框基准中实现最新性能,包括意图预测,对话框状态跟踪和端到端对话框建模。我们还表明,在低资源设置下,Space-3比现有模型具有更强的射击能力。
translated by 谷歌翻译
具有对比性学习目标的预训练方法在对话了解任务中表现出了显着的成功。但是,当前的对比学习仅将自调查的对话样本视为正样本,并将所有其他对话样本视为负面样本,即使在语义上相关的对话框中,也会强制执行不同的表示。在本文中,我们提出了一个树木结构化的预培训对话模型Space-2,该模型从有限标记的对话框和大规模的无标记的对话框COLPORA通过半监督的对比度预培训来学习对话框表示。具体而言,我们首先定义一个通用的语义树结构(STS),以统一不同对话框数据集的注释模式,以便可以利用所有标记数据中存储的丰富结构信息。然后,我们提出了一个新颖的多视图分数功能,以增加共享类似STS的所有可能对话框的相关性,并且在监督的对比预训练期间仅推开其他完全不同的对话框。为了充分利用未标记的对话,还增加了基本的自我监督对比损失,以完善学习的表示。实验表明,我们的方法可以在DialogLue基准测试中实现新的最新结果,该基准由七个数据集和四个流行的对话框组成。为了获得可重复性,我们在https://github.com/alibabaresearch/damo-convai/tree/main/main/space-2上发布代码和数据。
translated by 谷歌翻译
预先训练的模型已经证明是强大的增强面向任务的对话系统。但是,目前的预训练方法主要关注增强对话的理解和生成任务,同时忽略对话策略的开发。在本文中,我们提出了一个小说预先训练的对话模型,明确地通过半监督学习明确地从有限标记的对话框和大规模未标记的对话框中学习对话策略。具体而言,我们在预训练期间介绍一个对话框预测任务,以便在预训练中进行策略优化,并使用一致性正则化术语在未标记的对话的帮助下优化学习的表示。我们还实施了一个浇注机制来称量合适的未标记对话框样本。经验结果表明,星系大大提高了面向任务为导向的对话系统的性能,并在基准数据集中实现了新的最先进结果:车载,多种多纤2.0和多纺,改善其端到端合并分数2.5,5.3和5.5分。我们还显示Galaxy比各种低资源设置下的现有模型更强大的少量射击能力。
translated by 谷歌翻译
在本文中,我们建议将面向任务导向的对话系统作为纯粹的自然语言生成任务,以便充分利用像GPT-2这样的大规模预训练模型,并简化了复杂的光学化预备。然而,直接应用这种方法严重遭受了通过删除了替代令牌而导致的对话实体不一致,以及在微调期间灾害模型的灾难性遗忘问题,导致表现不令人满意。为了缓解这些问题,我们设计了一种新颖的GPT-Adapter-CopyNet网络,它将轻量级适配器和CopyNet模块包含到GPT-2中,以实现转移学习和对话实体生成的更好性能。在DSTC8轨道1基准和多种数据集上进行的实验结果表明,我们的建议方法显着优于基线模型,在自动和人类评估中具有显着性能。
translated by 谷歌翻译
When using LiDAR semantic segmentation models for safety-critical applications such as autonomous driving, it is essential to understand and improve their robustness with respect to a large range of LiDAR corruptions. In this paper, we aim to comprehensively analyze the robustness of LiDAR semantic segmentation models under various corruptions. To rigorously evaluate the robustness and generalizability of current approaches, we propose a new benchmark called SemanticKITTI-C, which features 16 out-of-domain LiDAR corruptions in three groups, namely adverse weather, measurement noise and cross-device discrepancy. Then, we systematically investigate 11 LiDAR semantic segmentation models, especially spanning different input representations (e.g., point clouds, voxels, projected images, and etc.), network architectures and training schemes. Through this study, we obtain two insights: 1) We find out that the input representation plays a crucial role in robustness. Specifically, under specific corruptions, different representations perform variously. 2) Although state-of-the-art methods on LiDAR semantic segmentation achieve promising results on clean data, they are less robust when dealing with noisy data. Finally, based on the above observations, we design a robust LiDAR segmentation model (RLSeg) which greatly boosts the robustness with simple but effective modifications. It is promising that our benchmark, comprehensive analysis, and observations can boost future research in robust LiDAR semantic segmentation for safety-critical applications.
translated by 谷歌翻译
In recent years, arbitrary image style transfer has attracted more and more attention. Given a pair of content and style images, a stylized one is hoped that retains the content from the former while catching style patterns from the latter. However, it is difficult to simultaneously keep well the trade-off between the content details and the style features. To stylize the image with sufficient style patterns, the content details may be damaged and sometimes the objects of images can not be distinguished clearly. For this reason, we present a new transformer-based method named STT for image style transfer and an edge loss which can enhance the content details apparently to avoid generating blurred results for excessive rendering on style features. Qualitative and quantitative experiments demonstrate that STT achieves comparable performance to state-of-the-art image style transfer methods while alleviating the content leak problem.
translated by 谷歌翻译
With the increasing ability of large language models (LLMs), in-context learning (ICL) has become a new paradigm for natural language processing (NLP), where LLMs make predictions only based on contexts augmented with a few training examples. It has been a new trend exploring ICL to evaluate and extrapolate the ability of LLMs. In this paper, we aim to survey and summarize the progress, challenges, and future work in ICL. We first present a formal definition of ICL and clarify its correlation to related studies. Then, we organize and discuss advanced techniques of ICL, including training strategies, prompting strategies, and so on. Finally, we present the challenges of ICL and provide potential directions for further research. We hope our work can encourage more research on uncovering how ICL works and improving ICL in future work.
translated by 谷歌翻译
Gaze estimation is the fundamental basis for many visual tasks. Yet, the high cost of acquiring gaze datasets with 3D annotations hinders the optimization and application of gaze estimation models. In this work, we propose a novel Head-Eye redirection parametric model based on Neural Radiance Field, which allows dense gaze data generation with view consistency and accurate gaze direction. Moreover, our head-eye redirection parametric model can decouple the face and eyes for separate neural rendering, so it can achieve the purpose of separately controlling the attributes of the face, identity, illumination, and eye gaze direction. Thus diverse 3D-aware gaze datasets could be obtained by manipulating the latent code belonging to different face attributions in an unsupervised manner. Extensive experiments on several benchmarks demonstrate the effectiveness of our method in domain generalization and domain adaptation for gaze estimation tasks.
translated by 谷歌翻译
Generalizability to unseen forgery types is crucial for face forgery detectors. Recent works have made significant progress in terms of generalization by synthetic forgery data augmentation. In this work, we explore another path for improving the generalization. Our goal is to reduce the features that are easy to learn in the training phase, so as to reduce the risk of overfitting on specific forgery types. Specifically, in our method, a teacher network takes as input the face images and generates an attention map of the deep features by a diverse multihead attention ViT. The attention map is used to guide a student network to focus on the low-attended features by reducing the highly-attended deep features. A deep feature mixup strategy is also proposed to synthesize forgeries in the feature domain. Experiments demonstrate that, without data augmentation, our method is able to achieve promising performances on unseen forgeries and highly compressed data.
translated by 谷歌翻译
The development of deep learning models in medical image analysis is majorly limited by the lack of large-sized and well-annotated datasets. Unsupervised learning does not require labels and is more suitable for solving medical image analysis problems. However, most of the current unsupervised learning methods need to be applied to large datasets. To make unsupervised learning applicable to small datasets, we proposed Swin MAE, which is a masked autoencoder with Swin Transformer as its backbone. Even on a dataset of only a few thousand medical images and without using any pre-trained models, Swin MAE is still able to learn useful semantic features purely from images. It can equal or even slightly outperform the supervised model obtained by Swin Transformer trained on ImageNet in terms of the transfer learning results of downstream tasks. The code will be publicly available soon.
translated by 谷歌翻译